feat[vortex-cuda]: GPU FSST decompression kernel by asubiotto · Pull Request #7776 · vortex-data/vortex

asubiotto · 2026-05-04T12:21:34Z

Summary

This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the throughput_cuda benchmark on a DGX spark. CPU work is required to compute the output offsets.

The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64).

The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there.

Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.

Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying.

Addresses: #6538

Testing

Unit tests against the CPU implementation on small and larger dataset.

codspeed-hq · 2026-05-04T12:27:53Z

Merging this PR will degrade performance by 10.6%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 1 regressed benchmark
✅ 1168 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	246.1 ns	275.3 ns	-10.6%

_{Comparing asubiotto/fsst-cuda (bebed3a) with develop (903ee6c)}

a10y · 2026-05-04T14:16:33Z

Very cool!

asubiotto · 2026-05-05T11:01:51Z

Thanks for the review! Addressed the comments.

0ax1

Fantastic contribution, one question inline.

joseph-isaacs · 2026-05-05T11:44:48Z

+
+        // Prefix-sum lens to per-string u32 output offsets so the kernel
+        // knows where to write each decoded string.
+        #[expect(clippy::cast_possible_truncation)]


This looks unsound to me

Yes, this is leftover from the hackathon. Ignore. I will template on output offsets.

Or just use u64, but that'd be a little wasteful.

I chose to just use u64, not sure templating is worth it. LMK and I can change the approach.

joseph-isaacs · 2026-05-05T11:46:05Z

+        let output_offsets: Vec<u32> = match_each_integer_ptype!(lens.ptype(), |P| {
+            let mut out = Vec::with_capacity(lens.len() + 1);
+            let mut acc: usize = 0;
+            out.push(0u32);
+            for &l in lens.as_slice::<P>() {
+                acc += l as usize;
+                out.push(acc as u32);
+            }
+            out
+        });


would be nice if this was computed in the kernel too?

How would you suggest to do this? The kernel needs to know the output offsets in order to decode and I think the only thing that would work would be to do an O(n^2) prefix sum of all lengths for every string. I think it's probably cheaper to execute this linear computation with dependencies on the CPU. A GSST encoding would obviate the need for this so I would just punt on this.

This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the `throughput_cuda` benchmark on a DGX spark. CPU work is required to compute the output offsets. The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64). The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there. Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M. Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying. Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>

asubiotto · 2026-05-05T13:20:14Z

Updated to use u64 output offsets and cleaned up the kernel parameters into an args struct.

0ax1 · 2026-05-05T13:34:39Z

Gave the benchmark a spin also on my end on a GH200 (which has 4 TB/s memory bandwidth):

  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="sudo ncu \
    --kernel-name fsst_u32 \
    --launch-skip 2 \
    --launch-count 1 \
    --metrics \
  smsp__thread_inst_executed_per_inst_executed,\
  smsp__thread_inst_executed_pred_on_per_inst_executed,\
  smsp__sass_average_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_divergent,\
  smsp__sass_inst_executed_op_branch,\
  smsp__warp_issue_stalled_branch_resolving_per_warp_active,\
  smsp__average_warps_issue_stalled_branch_resolving_per_issue_active,\
  smsp__warps_eligible,\
  smsp__issue_active,\
  smsp__inst_issued_per_issue_active" \
  cargo bench -p vortex-cuda --bench fsst_cuda --target aarch64-unknown-linux-gnu -- cuda/fsst/decompress/10M


---------------------------------------------------------------------------- ----------- -------------
Metric Name                                                                  Metric Unit  Metric Value
---------------------------------------------------------------------------- ----------- -------------
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.max_rate        inst             1
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.pct                %         84.03
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.ratio           inst          0.84
smsp__inst_issued_per_issue_active.max_rate                                   inst/cycle             1
smsp__inst_issued_per_issue_active.pct                                                 %           100
smsp__inst_issued_per_issue_active.ratio                                      inst/cycle             1
smsp__issue_active.avg                                                             cycle  4,200,828.18
smsp__issue_active.max                                                             cycle     4,341,189
smsp__issue_active.min                                                             cycle     4,062,625
smsp__issue_active.sum                                                             cycle 2,218,037,277
smsp__sass_average_branch_targets_threads_uniform.max_rate                                           1
smsp__sass_average_branch_targets_threads_uniform.pct                                  %         55.52
smsp__sass_average_branch_targets_threads_uniform.ratio                                           0.56
smsp__sass_branch_targets_threads_divergent.avg                                             124,624.90
smsp__sass_branch_targets_threads_divergent.max                                                162,591
smsp__sass_branch_targets_threads_divergent.min                                                 80,209
smsp__sass_branch_targets_threads_divergent.sum                                             65,801,947
smsp__sass_branch_targets_threads_uniform.avg                                               155,567.00
smsp__sass_branch_targets_threads_uniform.max                                                  203,499
smsp__sass_branch_targets_threads_uniform.min                                                   99,768
smsp__sass_branch_targets_threads_uniform.sum                                               82,139,378
smsp__sass_inst_executed_op_branch.avg                                              inst    280,191.90
smsp__sass_inst_executed_op_branch.max                                              inst       365,698
smsp__sass_inst_executed_op_branch.min                                              inst       180,368
smsp__sass_inst_executed_op_branch.sum                                              inst   147,941,325
smsp__thread_inst_executed_per_inst_executed.max_rate                                               32
smsp__thread_inst_executed_per_inst_executed.pct                                       %         32.40
smsp__thread_inst_executed_per_inst_executed.ratio                                               10.37
smsp__thread_inst_executed_pred_on_per_inst_executed.max_rate                                       32
smsp__thread_inst_executed_pred_on_per_inst_executed.pct                               %         25.77
smsp__thread_inst_executed_pred_on_per_inst_executed.ratio                                        8.25
smsp__warp_issue_stalled_branch_resolving_per_warp_active.max_rate                                   1
smsp__warp_issue_stalled_branch_resolving_per_warp_active.pct                          %          7.53
smsp__warp_issue_stalled_branch_resolving_per_warp_active.ratio                                   0.08
smsp__warps_eligible.avg                                                            warp 18,937,555.73
smsp__warps_eligible.max                                                            warp    19,878,429
smsp__warps_eligible.min                                                            warp    18,067,927
smsp__warps_eligible.sum                                                            warp 9,999,029,425
---------------------------------------------------------------------------- ----------- -------------

cuda/cuda/fsst/decompress/10M
                        time:   [2.9651 ms 2.9839 ms 3.0055 ms]
                        thrpt:  [286.19 GiB/s 288.26 GiB/s 290.08 GiB/s]
                 change:
                        time:   [-12.955% -12.300% -11.621%] (p = 0.00 < 0.05)
                        thrpt:  [+13.149% +14.025% +14.883%]
                        Performance has improved.

So the main bottleneck for this kernel is warp divergence:

smsp__sass_average_branch_targets_threads_uniform.pct 55.52%

For comparison we e.g. get

  cuda/cuda/runend_100runs/dispatch_u32/10M
  time:   [26.832 µs 29.795 µs 35.405 µs]
  thrpt:  [1052.2 GiB/s 1250.3 GiB/s 1388.4 GiB/s]

  cuda/cuda/runend_100runs/dispatch_u32/100M
  time:   [159.44 µs 160.21 µs 161.11 µs]
  thrpt:  [2312.2 GiB/s 2325.2 GiB/s 2336.4 GiB/s]

for runend.

0ax1 · 2026-05-05T13:42:17Z

+
+use crate::timed_launch_strategy::TimedLaunchStrategy;
+
+const BENCH_SIZES: &[(usize, &str)] = &[(1_000_000, "1M"), (5_000_000, "5M"), (10_000_000, "10M")];


At this point we run all benchmarks for vortex-cuda with 100M and re-use the constant from vortex-cuda/benches/bench_config/mod.rs. 10M inputs proved to be too noisy when running the benchmarks in codspeed and can't saturate larger GPUs like a GH200.

Yeah the difference is that with FSST each input is a clickbench URL. Happy to use 100M if you want, but I think that's around 10GB.

Yeah fair, we can keep 10M but let's add comment on the constant why it's diff here or so.

Oh we also need to wire in the bench to codspeed in: .github/workflows/codspeed.yml. We can put it next to zstd which is nice for comparison. - { shard: 7, name: "Encodings 4", packages: "vortex-sparse vortex-zigzag vortex-zstd" }.

asubiotto · 2026-05-05T13:42:40Z

Gave the benchmark a spin also on my end on a GH200 (which has TB/s memory bandwidth):

  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="sudo ncu \
    --kernel-name fsst_u32 \
    --launch-skip 2 \
    --launch-count 1 \
    --metrics \
  smsp__thread_inst_executed_per_inst_executed,\
  smsp__thread_inst_executed_pred_on_per_inst_executed,\
  smsp__sass_average_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_divergent,\
  smsp__sass_inst_executed_op_branch,\
  smsp__warp_issue_stalled_branch_resolving_per_warp_active,\
  smsp__average_warps_issue_stalled_branch_resolving_per_issue_active,\
  smsp__warps_eligible,\
  smsp__issue_active,\
  smsp__inst_issued_per_issue_active" \
  cargo bench -p vortex-cuda --bench fsst_cuda --target aarch64-unknown-linux-gnu -- cuda/fsst/decompress/10M


---------------------------------------------------------------------------- ----------- -------------
Metric Name                                                                  Metric Unit  Metric Value
---------------------------------------------------------------------------- ----------- -------------
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.max_rate        inst             1
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.pct                %         84.03
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.ratio           inst          0.84
smsp__inst_issued_per_issue_active.max_rate                                   inst/cycle             1
smsp__inst_issued_per_issue_active.pct                                                 %           100
smsp__inst_issued_per_issue_active.ratio                                      inst/cycle             1
smsp__issue_active.avg                                                             cycle  4,200,828.18
smsp__issue_active.max                                                             cycle     4,341,189
smsp__issue_active.min                                                             cycle     4,062,625
smsp__issue_active.sum                                                             cycle 2,218,037,277
smsp__sass_average_branch_targets_threads_uniform.max_rate                                           1
smsp__sass_average_branch_targets_threads_uniform.pct                                  %         55.52
smsp__sass_average_branch_targets_threads_uniform.ratio                                           0.56
smsp__sass_branch_targets_threads_divergent.avg                                             124,624.90
smsp__sass_branch_targets_threads_divergent.max                                                162,591
smsp__sass_branch_targets_threads_divergent.min                                                 80,209
smsp__sass_branch_targets_threads_divergent.sum                                             65,801,947
smsp__sass_branch_targets_threads_uniform.avg                                               155,567.00
smsp__sass_branch_targets_threads_uniform.max                                                  203,499
smsp__sass_branch_targets_threads_uniform.min                                                   99,768
smsp__sass_branch_targets_threads_uniform.sum                                               82,139,378
smsp__sass_inst_executed_op_branch.avg                                              inst    280,191.90
smsp__sass_inst_executed_op_branch.max                                              inst       365,698
smsp__sass_inst_executed_op_branch.min                                              inst       180,368
smsp__sass_inst_executed_op_branch.sum                                              inst   147,941,325
smsp__thread_inst_executed_per_inst_executed.max_rate                                               32
smsp__thread_inst_executed_per_inst_executed.pct                                       %         32.40
smsp__thread_inst_executed_per_inst_executed.ratio                                               10.37
smsp__thread_inst_executed_pred_on_per_inst_executed.max_rate                                       32
smsp__thread_inst_executed_pred_on_per_inst_executed.pct                               %         25.77
smsp__thread_inst_executed_pred_on_per_inst_executed.ratio                                        8.25
smsp__warp_issue_stalled_branch_resolving_per_warp_active.max_rate                                   1
smsp__warp_issue_stalled_branch_resolving_per_warp_active.pct                          %          7.53
smsp__warp_issue_stalled_branch_resolving_per_warp_active.ratio                                   0.08
smsp__warps_eligible.avg                                                            warp 18,937,555.73
smsp__warps_eligible.max                                                            warp    19,878,429
smsp__warps_eligible.min                                                            warp    18,067,927
smsp__warps_eligible.sum                                                            warp 9,999,029,425
---------------------------------------------------------------------------- ----------- -------------

cuda/cuda/fsst/decompress/10M
                        time:   [2.9651 ms 2.9839 ms 3.0055 ms]
                        thrpt:  [286.19 GiB/s 288.26 GiB/s 290.08 GiB/s]
                 change:
                        time:   [-12.955% -12.300% -11.621%] (p = 0.00 < 0.05)
                        thrpt:  [+13.149% +14.025% +14.883%]
                        Performance has improved.

So the main bottleneck for this kernel is warp divergence:

smsp__sass_average_branch_targets_threads_uniform.pct 55.52%

Nice, wish I had one of those lying around 😂 Yes, that's the main idea behind the GSST splits. Precomputing splits on the CPU showed +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M on the DGX spark. While we can precompute splits I think maybe it's a little too much CPU prep and this is rather a motivation for the GSST encoding? Happy to introduce splits in a follow up PR if we want to do that.

0ax1 · 2026-05-05T13:48:45Z

Nice, wish I had one of those lying around 😂 Yes, that's the main idea behind the GSST splits. Precomputing splits on the CPU showed +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M on the DGX spark. While we can precompute splits I think maybe it's a little too much CPU prep and this is rather a motivation for the GSST encoding? Happy to introduce splits in a follow up PR if we want to do that.

Heh, I mean this is clearly not a blocker but I wanted to pin down the exact numbers. Can we def land as is from a perf end.

joseph-isaacs · 2026-05-05T13:56:56Z

+    let host_bytes = CudaDeviceBuffer::new(device_output)
+        .copy_to_host(Alignment::new(1))?
+        .await?;
+    let host_bytes = host_bytes.slice(0..total_size);
+
+    let (buffers, views) = match_each_integer_ptype!(lens.ptype(), |P| {
+        build_views(
+            0,
+            MAX_BUFFER_LEN,
+            host_bytes.into_mut(),
+            lens.as_slice::<P>(),
+        )
+    });


Can we build the views on the GPU?

Probably since we already compute the output offsets. It's just a question of modding by i32::MAX to split up the output into a couple of buffers.

asubiotto commented May 4, 2026

View reviewed changes

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated

asubiotto added the changelog/performance A performance improvement label May 4, 2026

asubiotto force-pushed the asubiotto/fsst-cuda branch from c982cd8 to 007bdab Compare May 4, 2026 12:29

asubiotto added changelog/feature A new feature and removed changelog/performance A performance improvement labels May 4, 2026

asubiotto requested review from 0ax1 and robert3005 May 4, 2026 12:33

asubiotto force-pushed the asubiotto/fsst-cuda branch 3 times, most recently from a10b28e to 79d1f10 Compare May 4, 2026 12:44

a10y reviewed May 4, 2026

View reviewed changes

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

asubiotto force-pushed the asubiotto/fsst-cuda branch from 79d1f10 to 46670e8 Compare May 5, 2026 10:57

0ax1 reviewed May 5, 2026

View reviewed changes

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated